2 research outputs found

    Clustering of categorical variables around latent variables

    Get PDF
    In the framework of clustering, the usual aim is to cluster observations and not variables. However the issue of variable clustering clearly appears for dimension reduction, selection of variables or in some case studies (sensory analysis, biochemistry, marketing, etc.). Clustering of variables is then studied as a way to arrange variables into homogeneous clusters, thereby organizing data into meaningful structures. Once the variables are clustered into groups such that variables are similar to the other variables belonging to their cluster, the selection of a subset of variables is possible. Several specific methods have been developed for the clustering of numerical variables. However concerning categorical variables, much less methods have been proposed. In this paper we extend the criterion used by Vigneau and Qannari (2003) in their Clustering around Latent Variables approach for numerical variables to the case of categorical data. The homogeneity criterion of a cluster of categorical variables is defined as the sum of the correlation ratio between the categorical variables and a latent variable, which is in this case a numerical variable. We show that the latent variable maximizing the homogeneity of a cluster can be obtained with Multiple Correspondence Analysis. Different algorithms for the clustering of categorical variables are proposed: iterative relocation algorithm, ascendant and divisive hierarchical clustering. The proposed methodology is illustrated by a real data application to satisfaction of pleasure craft operators.clustering of categorical variables, correlation ratio, iterative relocation algorithm, hierarchical clustering

    Rotation in Multiple Correspondence Analysis: a planar rotation iterative procedure

    Get PDF
    Multiple Correspondence Analysis (MCA) is a well-known multivariate method for statistical description of categorical data (see for instance Greenacre and Blasius, 2006). Similarly to what is done in Principal Component Analysis (PCA) and Factor Analysis, the MCA solution can be rotated to increase the components simplicity. The idea behind a rotation is to find subsets of variables which coincide more clearly with the rotated components. This implies that maximizing components simplicity can help in factor interpretation and in variables clustering. In PCA, the probably most famous rotation criterion is the varimax one introduced by Kaiser (1958). Besides, Kiers (1991) proposed a rotation criterion in his method named PCAMIX developed for the analysis of both numerical and categorical data, and including PCA and MCA as special cases. In case of only categorical data, this criterion is a varimax-based one relying on the correlation ratio between the categorical variables and the MCA numerical components. The optimization of this criterion is then reached by the algorithm of De Leeuw and Pruzansky (1978). In this paper, we give the analytic expression of the optimal angle of planar rotation for this criterion. If more than two principal components are to be retained, similarly to what is done by Kaiser (1958) for PCA, this planar solution is computed in a practical algorithm applying successive pairwise planar rotations for optimizing the rotation criterion. A simulation study is used to illustrate the analytic expression of the angle for planar rotation. The proposed procedure is also applied on a real data set to show the possible benefits of using rotation in MCA.categorical data, multiple correspondence analysis, correlation ratio, rotation, varimax criterion
    corecore